Focusing Web Crawls On Location-Specific Content

نویسندگان

Lefteris Kozanidis

Sofia Stamou

George Spiros

چکیده

Retrieving relevant data for location-sensitive keyword queries is a challenging task that has so far been addressed as a problem of automatically determining the geographical orientation of web searches. Unfortunately, identifying localizable queries is not sufficient per se for performing successful location-sensitive searches, unless there exists a geo-referenced index of data sources against which localizable queries are searched. In this paper, we propose a novel approach towards the automatic construction of a geo-referenced search engine index. Our approach relies on a geo-focused crawler that incorporates a structural parser and uses GeoWordNet as a knowledge base in order to automatically deduce the geo-spatial information that is latent in the pages’ contents. Based on location-descriptive elements in the page URLs and anchor text, the crawler directs the pages to a location-sensitive downloader. This downloading module resolves the geographical references of the URL location elements and organizes them into indexable hierarchical structures. The location-aware URL hierarchies are linked to their respective pages, resulting into a georeferenced index against which location-sensitive queries can be answered.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

'Oh Web Image, Where Art Thou?'

Web image search today is mostly keyword-based and explores the content surrounding the image. Searching for images related to a certain location quickly shows that Web images typically do not reveal their explicit relation to an actual geographic position. The geographic semantics of Web images are either not available at all or hidden somewhere within the the Web pages’ content. Our spatial s...

متن کامل

Geospatial Web Image Mining

One commonly asked question when confronted with a photograph is “Where is this place?” When talking about a place mentioned on the Web, the question arises “What does this place look like?” Today, these questions can not reliably be answered for Web images as they typically do not explicitly reveal their relationship to an actual geographic position. Analysis of the keywords surrounding the im...

متن کامل

The iCrawl System for Focused and Integrated Web Archive Crawling

The large size of the Web makes it infeasible for many institutions to collect, store and process archives of the entire Web. Instead, many institutions focus on creating archives of specific subsets of the Web. These subsets may be based around specific topics or events. Our iCrawl system provides a focused crawler that is able to automatically collect Web pages relevant to a topic based on co...

متن کامل

Unsupervised Relation Extraction of In-Domain Data from Focused Crawls

This thesis proposal approaches unsupervised relation extraction from web data, which is collected by crawling only those parts of the web that are from the same domain as a relatively small reference corpus. The first part of this proposal is concerned with the efficient discovery of web documents for a particular domain and in a particular language. We create a combined, focused web crawling ...

متن کامل

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depthfirst strat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Focusing Web Crawls On Location-Specific Content

نویسندگان

چکیده

منابع مشابه

'Oh Web Image, Where Art Thou?'

Geospatial Web Image Mining

The iCrawl System for Focused and Integrated Web Archive Crawling

Unsupervised Relation Extraction of In-Domain Data from Focused Crawls

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

عنوان ژورنال:

اشتراک گذاری